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Abstract 

Approximate dynamic programming has 
been used successfully in a large variety of do- 
mains, but it relies on a small set of provided 
approximation features to calculate solutions 
reliably. Large and rich sets of features can 
cause existing algorithms to overfit because 
of a limited number of samples. We address 
this shortcoming using Li regularization in 
approximate linear programming. Because 
the proposed method can automatically se- 
lect the appropriate richness of features, its 
performance does not degrade with an in- 
creasing number of features. These results 
rely on new and stronger sampling bounds 
for regularized approximate linear programs. 
We also propose a computationally efficient 
homotopy method. The empirical evalua- 
tion of the approach shows that the proposed 
method performs well on simple MDPs and 
standard benchmark problems. 



1. Introduction 

Solving large Markov Decision Processes (MDPs) is 
a very useful, but computationally challenging prob- 
lem addressed widely by reinforcement learning. It 
is widely accepted that large MDPs can only be 
solved approximately. This approximation is com- 
monly done by relying on linear value function approx- 
imation, in which the value function is chosen from a 
small-dimensional vector space of features. While this 
framework offers computational benefits and protec- 
tion from the overfitting in the training data, selecting 
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an effective, small set of features is difficult and re- 
quires a deep understanding of the domain. Feature 
selection, therefore, seeks to automate this process in 
a way that may preserve the computational simplicity 
of linear approximation (Parr et al., 2007; Mahadevan, 
2008). We show in this paper that Li-regularized ap- 
proximate linear programs (RALP) can be used with 
very rich feature spaces. 

RALP relies, like other value function approximation 
methods, on samples of the state space. The value 
function error on states that arc not sampled is known 
as the sampling error. This paper shows that regular- 
ization in RALP can guarantee small sampling error. 
The bounds on the sampling error require somewhat 
limiting assumptions on the structure of the MDPs, as 
any guarantees must, but this framework can be used 
to derive tighter bounds for specific problems in the 
future. The relatively simple bounds can be used to 
determine automatically the regularization coefficient 
to balance the expressivity of the features with the 
sampling error. 

We derive the approach with the Li norm, but it could 
be used with other regularizations with small modifi- 
cations. The Li norm is advantageous for two main 
reasons. First, the Li norm encourages the sparse so- 
lutions, which can reduce the computational require- 
ments. Second, the Li norm preserves the linearity 
of RALPs; the L2 norm would require quadratic opti- 
mization. 

Regularization using the Li norm has been widely 
used in regression problems by methods such 
as LASSO (Tibshirani, 1996) and Dantzig selec- 
tor (Candes & Tao, 2007). The value-function approx- 
imation setting is, however, quite different and the re- 
gression methods are not directly applicable. Regular- 
ization has been previously used in value function ap- 
proximation (Taylor & Parr, 2009; Farahmand et al.. 
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2008; Kolter & Ng, 2009). In comparison with LARS- 
TD (Kolter & Ng, 2009), an Li regularized value func- 
tion approximation method, we explicitly show the in- 
fluence of regularization on the sampling error, pro- 
vide a well-founded method for selecting the regu- 
larization parameter, and solve the full control prob- 
lem. In comparison with existing sampling bounds for 
ALP (de Farias & Van Roy, 2001), we do not assume 
that the optimal policy is available, make more general 
assumptions, and derive bounds that are independent 
of the number of features. 

Our approach is based on approximate linear program- 
ming (ALP), which offers stronger theoretical guaran- 
tees than some other value function approximation al- 
gorithms. We describe ALP in Section 3 and RALP 
and its basic properties in Section 4. RALP, unlike 
ordinary ALPs, is guaranteed to compute bounded so- 
lutions. We also briefly describe a homotopy algo- 
rithm for solving RALP, which exhibits anytime be- 
havior by gradually increasing the norm of feature 
weights. To develop methods that automatically se- 
lect features with generalization guarantees, we pro- 
pose general sampling bounds in Section 5. These 
sampling bounds, coupled with the homotopy method, 
can automatically choose the complexity of the fea- 
tures to minimize over-fitting. Our experimental re- 
sults in Section 6 show that the proposed approach 
with large feature sets is competitive with LSPI when 
performed even with small feature spaces hand se- 
lected for standard benchmark problems. Section 7 
concludes with future work and a more detailed rela- 
tionship with other methods. 

2. Framework and Notation 

In this section, we formally define Markov decision 
processes and linear value function approximation. 
A Markov Decision Process is a tuple (S,A,P,r,j), 
where S is the possibly infinite set of states, and A 
is the finite set of actions. P : 5 x 5 x ^ t— !> [0, 1] is 
the transition function, where P(s', s, a) represents the 
probability of transiting to state s' from state s, given 
action a. The function r : 5 x ^ h-> M is the reward 
function, and 7 is the discount factor. Pa and r^ are 
used to denote the probabilistic transition matrix and 
reward vector for action a. 

We are concerned with finding a value function v 
that maps each state s € S to the expected total 7- 
discounted reward for the process. Value functions can 
be useful in creating or analyzing a policy it : S x A ^ 
[0,1] such that for all s & S, J2aeA^i^'^) ~ ^- "^^^ 
transition and reward functions for a given policy are 
denoted by Pj^ and r^r. The value function update for 



a policy n is denoted by Lj^, and the Bellman operator 
is denoted by L. That is: 



L^v = •yP^v 



Lv 



max LttW. 
Tren 



The optimal value function v* satisfies Lv* ~ v* . 

We focus on linear value function approximation for 
discounted infinite-horizon problems, in which the 
value function is represented as a linear combination 
of nonlinear basis functions (vectors). For each state 
s, we define a vector (j){s) of features. The rows of the 
basis matrix $ correspond to (/)(s), and the approxima- 
tion space is generated by the columns of the matrix. 
That is, the basis matrix $, and the value function v 
are represented as: 



$ 



^1)"^ 



$ii). 



This form of linear representation allows for the cal- 
culation of an approximate value function in a lower- 
dimensional space, which provides significant compu- 
tational benefits over using a complete basis; if the 
number of features is small, this framework can also 
guard against overfitting noise in the samples. 

Definition 1. A value function, v, is representable if 
V e M <^ M'"^', where M = colspan(<I>). The set of 
e-transitive- feasible value functions is defined for e > 
as follows: /C(e) = {w € RI'^I \v > Lv - el. Here 1 
is a vector of all ones. A value function is transitive- 
feasible when v > Lv and the set of transitive-feasible 
functions is defined as /C = /C(0). 

Notice that the optimal value function v* is transitive- 
feasible, and that A^ is a linear space. 

3. Approximate Linear Programming 

The approximate linear programming (ALP) frame- 
work is an approach for calculating a value function 
approximation for large MDP with a set of features $ 
that define a linear space Ad (Schweitzer & Seidmann, 
1985; de Farias & Van Roy, 2003). The ALP takes the 
following form: 



s.t. r{s, a) + 7 NJ P{s' , s, a)v{s') < v(s) 



s'es 



where p is a distribution over the initial states and 
the constraints are for all {s,a) G {S,A); that is 
^s^S p{s) = 1. To ensure feasibility, one of the 
features is assumed to be constant. Therefore, in 
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the remainder of the paper, we make the fohowing 
standard assumption (Schweitzer & Seidmann, 1985), 
which can be satisfied by setting the first cohimn of 
M to 1. 

Assumption 2. For all fc G M, we have that fc-1 G M, 
where 1 is a vector of all ones. 



For simplicity and generality of notation, we use C to 
denote the ALP constraint matrix, so Cv < u is equal 
to the set of constraints {LaV < v : \/a E A}. Then, 
we can rewrite the ALP as follows: 



mm p V 

V 

s.t. Cv < V V e M 



(1) 



Notice that the constraints in the ALP correspond 
to the definition of transitive-feasible functions in 
Definition 1. A succinct notation of the ALP con- 
straints can then use the set of transitive-feasible func- 
tions as V G MnJC. 

The constraint v E A4 implies that v = ^w and there- 
fore the number of variables in (1) corresponds to the 
number of features. Typically, this is a small num- 
ber. However, the number of required constraints in 
ALP is |<S| X 1^1, which is oftentimes impractically 
large or infinite. The standard solution is to sample a 
small set of constraints according to a given distribu- 
tion (de Farias & Van Roy, 2003). It is then possible 
to bound the probability of violating a randomly cho- 
sen constraint. There are, however, a few difficulties 
with this approach. First, leaving constraints out can 
lead to an unbounded linear program. Second, in prac- 
tice the distribution over the constraints can be very 
different from the distribution assumed by the theory. 
Finally, the bound provides no guarantees on the so- 
lution quality. 

ALP has often under-performed ADP methods 
in practice; this issue has been recently studied 
and partially remedied (Petrik & Zilberstein, 2009; 
Desai et al., 2009). Because these methods are inde- 
pendent of the proposed modifications, we only focus 
on standard approximate linear programs. 

We show next that RALP with sampled constraints 
not only guarantees that the solution is bounded and 
provides worst-case error bounds on the value function, 
but also is independent of the number of features. As 
a result, the ALP formulation does not require a small 
number of features to be selected in advance. 



4. Regularized Approximate Linear 
Programming 

In this section, we introduce Li -regularized ALP 
(RALP) as an approach to automate feature selection 
and alleviate the need for all constraints in standard 
ALP. Adding Li regularization to ALP permits the 
user to supply an arbitrarily rich set of features with- 
out the risk of overfitting. 

The RALP for basis $ and Li constraint ijj is defined 
as follows: 



min p <f>w 

w 
s.t. jC^W < (f>W ||iu||l,e < ■0, 



(2) 



where ||u;||i,e = X^i '3(*)w(*)- Note that RALP is a 
generalization of ALP; when "0 approaches infinity, the 
RALP solution approaches the ALP solution. The ob- 
jective value of (2) as a function of i/' is denoted as 

We generally use e = 1 _ i , which is a vector of all ones 
except the first position, which is 0; because the first 
feature is the constant feature, we do not include it 
in the regularization. The main reasons for excluding 
the constant feature are that the policy is independent 
of the constant shifts, and the homotopy method we 
propose requires that the linear program is easy to 
solve when -0 = 0. 

Alternatively, we can formulate RALP in (2) as a mi- 
nor modification of ALP in equation (1). This is by 
modifying M to satisfy the Li norm as: 

X(0) = {$w|||u;||i,e <^}. 

Notice that RALP introduces an additional parameter 
■0 over ALP. As with Li regularization for regression, 
this raises some concerns about a method for choosing 
the regularization parameter. Practical methods, such 
as cross-validation may be used to address this issue. 
We also propose an automated method for choosing 
in Section 5 based on the problem and sampling pa- 
rameters. 

5. Sampling Bounds 

The purpose of this section is to show that RALP of- 
fers two main benefits over ALP. First, even when the 
constraints are sampled and incomplete, it is guar- 
anteed to provide a feasible solution. Since feasibil- 
ity does not imply that the solution is close to op- 
timal, we then show that under specific assumptions 
— such as smooth reward and transition functions — 
RALP guarantees that the error due to the missing 
constraints is small. 
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To bound the error from sampling, we must formally 
define the samples and how they are used to construet 
ALPs. We eonsider the following two types of samples 
S and E defined as follows. 

Definition 3. One-step simple samples are defined as 
follows: E C {(s, a, (si . . . Sn),r(s, a))\s G S, a G A}, 
where si . . . s„ are selected i.i.d. from the distribu- 
tion P(s, a, •). One-step samples with expectation are 
defined as follows: E C {{s,a,P{s,a,-),r{s,a)) \ s S 
5, a e A}. 

Often the samples only include state transitions, as E 
defines. The more informative samples E include the 
probability distribution of the states that follow the 
given state and action, as follows: 

Liv)is) = r(s, a) + 7 ^ P(s, a, s'^s'), (3) 

s'£S 

where {s,a, P{s,a, ■),r{s,a)) € E. The less- 
informative E can be used as follows: 



1 

i(z;)(s) =r(S,a)+7- Vw(s,), (4) 

n ^-^ 



where (s, a, {s\. . . s„), ?'(s, a)) G E. The corresponding 
transitive-feasible sets K, and K. are defined similarly. 
The ALPs can be constructed based on samples as 
Figure 1 shows. Full ALP corresponds to the RALP 
formulation in (2), when A^ is constricted with L\ 
regularization. In comparison, sampled ALP is missing 
some of the constraints while estimated ALP is both 
missing some constraints, and the included constraints 
may be estimated imprecisely. 

The following two assumptions quantify the behavior 
of the ALP with respect to missing and imprecise con- 
straints respectively. The first assumption limits the 
error due to missing transitions in the sampled Bell- 
man operator L. 

Assumption 4 (Constraint Sampling Behavior). The 
representable value functions satisfy that: 

/c n M{^) cjcnMitij)c /c(ep) n M{i^), 

and that for all representable value functions v £ 
M{ip) we have that \{p — p^v] < edip). 

The constant e^ bounds the potential violation of the 
ALP constraints on states that are not provided as a 
part of the sample. In addition, all value functions that 
are transitive-feasible for the full Bellman operator are 
transitive-feasible in the sampled version; the sampling 
only removes constraints on the set. The constant Ec 
essentially represents the maximal error in estimating 



the objective value of ALP for any representable value 
function. 

The next assumption quantifies the error on the es- 
timation of the transitions of the estimated Bellman 
operator L. 

Assumption 5 (Constraint Estimation Behavior). 
The representable value functions satisfy that: 

^(-e,) n 7W(^) C £ nMiij) C £(e,,) DMii:), 

where E and E (and therefore K- and JC) are defined 
for identical sets of states. 

These assumptions are quite generic in order to apply 
in a wide range of scenarios. The main idea behind 
the assumptions is to bound by how much a feasible 
solution in the sampled or estimated ALP can violate 
the true ALP constraints. These assumptions may be 
easily satisfied, for example, by making the following 
Lipschitz continuity assumptions. 

Assumption 6. Let k : S ^ M" be a map of the 
state-space to a normed vector space. Then for all 
x,y,z £ S and all features (columns) 4>i G <&, we define 
Kr, Kp, and K^ such that 

\r[x)-T{v)\<Kr\\k{x)-k[y)\\ 

\p{z\x,a) -p{z\y,a)\ < Kp\\k{x) - k{y)\\ 

\M^) - Uy)\ < K4k{x) - k{y)\\ 

Proposition 7. Assume Assumption 6 and that for 
any s G 5 there exists a state s 6 E such that ||s — s|| < 
c. Then Assumption 4 cind Assumption 5 hold with 
ep{ip) = cKr 4- cip{K,f, + ^Kp) 

Because of the technical nature of the proof, it is de- 
ferred to the appendix. 

The importance of this bound is that the violation on 
constraints that were not sampled grows linearly with 
the increasing coefficient ip. As we show below, this 
fact can be used to determine the optimal value of ip. 
For the sake of brevity, we do not discuss the estima- 
tion error bounds e,, in more detail, which can be eas- 
ily derived from existing results (Pctrik & Zilberstein, 
2009). 

We are now ready to state the following general bounds 
on the approximation error of a RALP. 

Theorem 8 (Offline Error Bound). Assume Assump- 
tions 2, 4j o-'nd 5. Let ii, v, v be the optimal solutions 
of (5), (6), and (7), respectively (see Figure 1). Let 
e = j^ niin^g^(^) \\v — w*||oo Then, the following in- 
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Full ALP 



L^E' 



Sampled ALP 

p = m E -^(^ 



mm p V 

V 

s.t. V e K. 



sSS 



{s,...)es 



V e M{tp) 



(5) 



mm p V 

V _ 

s.t. V e K. V £ Miip) 



(6) 





Estimated ALP 




1 


1^ <^(s) 

(s,...)eE 


min 


p-v 




s.t. 


veic 


V G >!(■! 



(7) 



Figure 1. Constructing ALP From Samples 



equalities hold: 



|w-u*||i.p < e 



\v^v*\\i.p<e + 2ecW + 2 " 



(V') 



1-7 

3£,(V>)+2£pW 

1-7 



Because of the technical nature of the proof, it is de- 
ferred to the appendix. 

Notice that because the weight p is often chosen arbi- 
trarily, the bounds may be simply derived for || • ||i_p. 
In that case, Cc ~ 0. Unlike most of the existing ALP 
bounds, we focus on bounding the error of the value 
function, instead of bounding the number of violated 
constraints. 

Consider the implications of these bounds combined 
with the Lipschitz assumptions of Proposition 7. It is 
clear that reducing -0 tightens Theorem 8, but causes 
the set M.{ip) to shrink and become more restrictive; 
this suggests a tradeoff to be considered when set- 
ting the regularization parameter. The bound also 
illustrates the importance of covering the space with 
samples; as the distance between the samples c ap- 
proaches zero, the bound tightens. In short, the Lip- 
schitz assumptions limit how quickly the constraints 
can change between sampled states. As sampled 
states get closer or the reward, feature, and probabil- 
ity functions become smoother, constraints between 
samples become more and more restricted; however, 
smoother basis functions may mean a less expressive 
space. Similar tradeoffs are likely to appear however 
Assumption 4 and Assumption 5 arc fulfilled. 

The offline error bounds in Theorem 8 can be used to 
guarantee the performance of a RALP for a fixed num- 
ber of samples and the regularization coefficient ip. It 
does not, however, prescribe how to choose the regu- 
larization coefficient for a given set of samples. To do 
that, we have to derive bounds for an actual value func- 
tion V. When the samples are known, these bounds are 
typically tighter than the offiine error bound. 



§4 

o 

m 



k\. Global n 


linimum 


V — v*^ ■^ ^ 

•••• vl' 

V 


— V 


V — V*-- 

■ri" 

^ ^ ^ 



5 



10 



Figure 2. Sketch of error bounds as a function of the reg- 
ularization coefficient. Here, v is the value function of the 
full ALP, V is the value function of the estimated ALP, and 
V* is the optimal value function. 

Theorem 9 (Online Error Bound). Assume 
Assumption 2 and let v G /C fl A4{ip) be an arbi- 
trary feasible solution of the estimated ALP (7). 
Then: 



li,P 



< P^v 



T * 
p V 



M 



.W+epW 



Because of the technical nature of the proof, it is de- 
ferred to the appendix. 

Here we briefly introduce a homotopy method that not 
only speeds the computation of the RALP solution, 
but also can be used to find the optimal value of ip. 
Because the homotopy method only relies on standard 
linear programming analysis and is somewhat techni- 
cal, the detailed description is provided in Appendix B 
in the appendix. 

The main idea of the homotopy method is to first cal- 
culate 9{0) and then trace the optimal solution for in- 
creasing values of ijj. The optimal solution w{'ip) of 
the RALP (2) is a piecewise linear function of tp. At 
any point in time, the algorithm keeps track of a set 
of active - or non-zero - variables w and a set of ac- 
tive constraints, which are satisfied with equality. In 
the linear segments, the number of active constraints 
and variables are identical, and the non-linearity arises 
when variables and constraints become active or inac- 
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tive. Therefore, the hnear segments are traced until 
a variable becomes inactive or a constraint becomes 
active. Then, the dual solution is traced until a con- 
straint becomes inactive or a variable becomes active. 

Since the homotopy algorithm solves for the optimal 
value of the RALP (2) for all values of the regulariza- 
tion coefficient ip, it is possible to increase the coeffi- 
cient ip until the error increase between sampled con- 
straints balances out the decrease in the error due to 
the restricted feature space as defined in Theorem 8. 
That is, we can calculate the objective value of the 
linear program (2) for any value of -;/'■ 

It is easy to find ip that minimizes the bounds in this 
section. As the following corollary shows, to find the 
global minimum of the bounds, it is sufficient to use 
the homotopy method to trace 9{'>p) while its derivative 
is less than the increase in the error due to the sam- 
pling {\\v — v\\i^p). Let v{ip) be an optimal solution of 
(7) as a function of the regularization coefficient ip. 

Corollary 10. Assume that edip), ^piip); o,i^d esi'>P) 
are convex functions of tp. Then, the error bound 
\\v{ip) - v*\\i^p < f{ip) for any v{ip) is: 



fW - OW 



T * 
p V 



^c{i') 



,(^) + ep(V;) 



The function f{ip) is convex and its sub- differential^ 
y,pf is independent of v* . Therefore, a global mini- 
mum "0* of f is attained when G V^/(V'*) or when 
ip* = 0. 

The corollary follows directly from Theorem 9 and the 
convexity of the optimal objective value of (2) as a 
function ip. Figure 2 illustrates the functions in the 
corollary. Notice that Proposition 7 is sufficient to sat- 
isfy the conditions of this corollary. In particular, the 
functions es{4'),£p{'4')T^c(jP) ^^^ linear in ip. 

6. Experimental Results 

In this section, we present results indicating that 
RALP effectively selects from rich feature spaces to 
outperform ALP and other common algorithms, such 
as LSPI, on several example problems, including the 
balanced pendulum and the bicycle problems. We also 
demonstrate the speed and effectiveness of the homo- 
topy method in choosing a value of ip. 

Benefits of Regularization First, we demonstrate 
and analyze the properties of RALP on a simple chain 
problem with 200 states, in which the transitions move 
to the right by one step with a centered Gaussian noise 



^Function / may be non-differentiable 



with standard deviation 3. The reward for reach- 
ing the right- most state was -1-1 and the reward in 
the 20th state was -3. This problem is small to en- 
able calculation of the optimal value function and to 
control sampling. We uniformly selected every fourth 
state on the chain and estimated the sampling bound 
tp{ip) = 0.05V'. The approximation basis in this prob- 
lem is represented by piecewise linear features, of the 
form (p{si) = [i — c] , , for c from 1 to 200; these fea- 
tures were chosen due to their strong guarantees for 
the sampling bounds. The experimental results were 
obtained using the proposed homotopy algorithm. 

Figure 3 demonstrates the solution quality of RALP 
on the chain problem as a function of the regulariza- 
tion coefficient ip. The figure shows that although the 
objective of RALP keeps decreasing as -0 increases, 
the sampling error overtakes that reduction. It is clear 
that a proper selection of V-" improves the quality of the 
resulting approximation. To demonstrate the benefits 
of regularization as it relates to overfitting, we com- 
pare the performance of ALP and RALP as a func- 
tion of the number of available features in Figure 5. 
While ALP performance improves initially, it degrades 
severely with more features. The value V' in RALP is 
selected automatically using Corollary 10 and the sam- 
pling bound of ep{ip) = O.OSi/;. Figure 4 demonstrates 
that RALP may also overfit, or perform poorly when 
the regularization coefficient ip is not selected properly. 

To find the proper value of ip, as described in 
Corollary 10, the problem needs to be solved using 
the homotopy method. We show that the homotopy 
method performs significantly faster than a commer- 
cially available linear program solver Mosek. Figure 6 
compares the computational time of homotopy method 
and Mosek, when solving the problem for multiple val- 
ues of Ip in increments of 0.5 on the standard mountain 
car problem (Barto & Sutton, 1998) with 901 piece- 
wise linear features and 6000 samples. Even for any 
single value ip, the homotopy method solves the linear 
program about 3 times faster than Mosek. The next 
two experiments, however, do not use the homotopy 
method. In practice, RALP often works much better 
than what is suggested by our bounds, which can be 
loose for sparsely sampled large state spaces. In the 
following experiments, we determined ip empirically by 
solving the RALP for several different values of ip and 
selecting the one that produced the best policy. This 
was practical because we could solve the large RALPs 
in just a few minutes using constraint generation. 

Inverted Pendulum We now offer experimental re- 
sults demonstrating RALP's ability to create effec- 
tive value functions in balancing an inverted pen- 
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Figure 3. Comparison of the ob- 
jective value of RALP with the 
true error. 
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Figure 4. Comparison of the per- 
formance RALP with two values 
of V' and the one chosen adap- 
tively (Corollary 10). 
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Figure 5. Average of 45 runs of 
ALP and RALP as a function 
of the number of features. Co- 
efficient if) was selected using 
Corollary 10. 
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Figure 6. Comparison of perfor- 
mance of homotopy method ver- 
sus Mosek as a function of V' in 
the mountain car domain. 
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Figure 7. RALP performance on 
pendulum as a function on the 
number of episodes. 
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Figure 8. RALP performance on 
bicycle as a function on the num- 
ber of episodes. 



dulum, a standard benchmark problem in reinforce- 
ment learning (Wang et al., 1996; Lagoudakis & Parr, 
2003). Samples of the form (s, a, r, s') were drawn from 
every action on states drawn from random trajectories 
with the pendulum starting in an upright state, re- 
ferred to as episodes. Features were kernels on 650 
states randomly selected from the training data, con- 
sisting of Gaussian kernels of standard deviation 0.5, 
1, and 1.5, and a 6*'' degree polynomial kernel. ip 
was 1.4, and an average of 25 features had non-zero 
weights. 

The constraints in the RALP were based on a single 
sample for each state and action pair. The policy was 
evaluated based on the number of steps it could bal- 
ance the pendulum, with an upper limit of 3000. This 
served to evaluate the policy resulting from the ap- 
proximate value function. We plot the average num- 
ber of steps the pendulum was balanced as a func- 
tion of the number of training episodes in Figure 7, 
as an average of 100 runs. It is clear the controller 
produced by RALP was effective for small amounts of 
data, balancing the pendulum for the maximum num- 
ber of steps nearly all of the time, even with only 50 
training episodes. Similarly, it was able to leverage 
the larger number of available features to construct an 
effective controller with fewer trajectories than LSPI, 
which needed 450 training episodes before achieving 



an average of 2500 balanced steps (Lagoudakis 
2003). 



Parr, 



Bicycle Balancing and Riding We also present 
experimental results for the bicycle problem, in which 
the goal is to learn to balance and ride a bicy- 
cle to a target position (Randl0v & Alstr0m, 1998; 
Lagoudakis & Parr. 2003). This is a challenging 
benchmark domain in reinforcement learning. Train- 
ing data consisted of samples for every action on states 
drawn from trajectories resulting from random actions 
up to 35 states long, similar to the inverted pendulum 
domain. The feature set consisted of monomials up 
to degree 4 on the individual dimensions and products 
of monomials up to degree 3, for a total of 159 fea- 
tures, tp was 0.03, and an average of 34 features had 
nonzero weights. We plot the number of runs out of 
100 in which the bicycle reached the goal region as a 
function of number of training episodes in Figure 8. 
Again, a high percentage of runs were successful, even 
with only 500 training episodes. In comparison, LSPI 
required 1500 training episodes to pass 80% success. It 
is worth pointing out that due to sampling every ac- 
tion at each state, RALP is using more samples than 
LSPI, but far fewer trajectories. 
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7. Conclusion and Related Work 

In this paper, we introduced Li-regularized Approxi- 
mate Linear Programming and demonstrated its prop- 
erties for combined feature selection and value func- 
tion approximation in reinforcement learning. RALP 
simultaneously addresses the feature selection, value 
function approximation, and policy determination 
problems; our experimental results demonstrate that 
it addresses these issues effectively for several sample 
problems, while our bounds explain the effects of sam- 
pling on the resulting approximation. 

There are many additional issues that need to be ad- 
dressed. The first is the construction of better bounds 
to guide the sampling. Our bounds explain the behav- 
ior of RALP approximation as it relates to the trade-off 
between the richness of the features with the number 
of available samples, but these bounds may at times 
be quite loose. Future work must identify conditions 
that can provide stronger guarantees. Additionally, 
a data-driven approach which can calculate a tighter 
bound online would be valuable. Finally, our analy- 
sis did not address the conditions that would guaran- 
tee sparse RALP solutions and, therefore, allow more 
computationally efficient solvers. 
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A. Proofs 

A.l. Sampling Bounds 

The following lemmas summarize the basic properties 
of the Bellman operator. We include them without 
proofs, which use standard techniques. 

Lemma 11 (Bellman Operator). Let 1 and v be a 

constant vector of an appropriate size. Then: 

L(w + el) =7el + Lv. 

Lemma 12 (Transitive Feasible are Upper Bound). 
Assume an e-transitive-feasible value function v G 
/C(e). That is: 

V > Lv — el. 



Then: 



V > V* 



1 



■7 



Lemma 13. Assume Assumption 2 and that there ex- 
ists V ^ A4 such that 

\\v -v*\\oa < e. 
Then, there exists v' Cz A4 H IC such that 

2e 



< 



1 



■7 



Theorem 8. [Offline Error Bound] Assume Assump- 
tions 2, 4; o,nd 5. Let v, v, v he the optimal solutions 
of (5), (6), and (7), respectively (see Figure 1). Let 



equalities hold: 



l"*^ ~ ''^*lloo Then, the following in- 



\v — V 



\i,P 



\i,P 



|i,P 



< e 



< e 



< e 



■ 2e,(V') + 2 
2ee(V) 



ep(V') 
1-7 
3£,(V>) + 2ep{ilj) 

1-7 



Proof. To simplify the notation, we omit ip in the no- 
tation of e in the proof. 

Proofof ||'f)-'y*||i,p: 

Let v be the minimizer of raixifj^M ||^ — *^*||oo- Then 
from Lemma 13, Lemma 12, and ||a;||i_p < Halloo there 
exists v' a Mr\lC such that: 



< 



1~7 
From Lemma 12, we have that: 

\\v — U*||l.p ~ p {v — V*) = p V — p V* < p v' — p V* 

< \\v'-v*\\i^p. 
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Proof of ||w-w*||i,p: 

Let v' be the solution (6) but with p^v as the objective 

function. From Lemma 12 we have: 



— ^ * 



1-7 



V > V — 



-1. 



1-7 

The difference between v and v' can be quantified as 
follows, using their same feasible sets and the fact that 
V is minimal with respect to p: 

P^v < p^v + ec< p^v' + ec< p^v' + 2ec. 

Since /C(ep) ^ /C wc also have that p~^v' < p^v. Then, 
using that v £ IC and v >v*: 



V — V + 



_fp_-l _ ^p 



1 — 7 1 — 7 



i,p 



< 



V — V + 



1-7" 



i,p 



1-7 



\ 1-7 / 1-' 



< p"^ (u - V*) + 2 



1-7 



1-7 

< pT (^ _ ^,*) + 2e, + 2-^^ 

1-7 

< ||i)-i;*||i,p + 2e, + 2. 



1-7 



Proofof ||w-t;*||i,p: 

Let v' be the solution (7) but with p as the objective 

function. Using Lemma 12 wc have: 

V > Lv > Lv ~ Esl > Lv — (eg + Cp)! 
e.s + ep 1 



V > V 



1-7 



All of these inequalities also hold for v' since it is also 
feasible in (7). From Lemma 11: 



1-7 



-leJC. 



Therefore: 



P^v' < p'^v + 



1-7 



The difference between v and v' can be bounded as 
follows, using their same feasible sets and the fact that 
V is minimal with respect to p: 

p^v < p^v + ec< p^v' + Cc < p^v' + 2ec 



Then: 

ll^-«*lli,P< 
< 



^ , ^s ~y~ ^p -, ^s \ ^p -. 
V — V + — ^1 : -1 



* , £s + ep 
V — V + — -1 



1-7 1-7 i,p 

e.s + £p 

1-7 



< p' [v-v* + 



1-7 

(s + fp 



l,p 



1 + 



<p' {v-v*)+2 



1 — 7 / 1 — 7 
Cs + ep 



1-7 



1-7 
Se,, + 2ep 



<pT(v-w*)+2ec + 
< ||t)-t'*||i,p + 2ec + 



1-7 

SEs + 2£p 

1-7 



D 



Theorem 9. [Online Error Bound] Assume 

Assumption 2 and let v £ JC H A4{ip) be an arbitrary 
feasible solution of the estimated ALP (7). Then: 

\\v* - .|kp < p-v - p-v* + e.WO + 2^iM±i^. 

1-7 



Proof. To simplify the notation, we omit ip in the no- 
tation of e in the proof. Using Lemma 12 we have: 

V > Lv > Lv — egl > Lv — [e^ + ep)l 



V > V 



1-7 
Then, using the above: 



-\. 



\v-v lli^p = \\v 



— V 



^s T~ ^p .| ^s ~r ^p I I 



1 — 7 1 — 7 

i — 7 i 

£., + e. 



i,p 
p 



7 



p"^!; - pv* + 2- 



cp 



1-7 

p'v-pv*+ec + 2'-^^^ 
1-7 
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A. 2. Sampling Guarantees 



Let A: : iS — )■ M" be a map of the state-space to an 
Euclidean space. This basically means that the states 
of the MDP can be mapped to an normed vector space. 
This assumption trivially generalizes Assumption 6. 
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Assumption 14. Assume that samples S are avail- 
able. Assume also that features and transitions satisfy 
Lipschitz-type constraints: 



||</'(s)-0(s)||oo<i^0||fc(s)-fc(s)|| 

\ris) ^ ris)\ < KrWHs) - k{s)\\ 
||P(s, af(j), - P(s, a)T0,||^ < Kp\\k{s) - fc(s)|| Va e A 

Here (j)i denotes a vector representation of a feature 
across all states. 

It is easy to show the following: 

Proposition 15. Assume Assumption 6 and that 
\\4>i\\i = 1. Then Assumption 14 holds with the iden- 
tical constants. 

The proof is simple and relies on the trivial ver- 
sion of Holder's inequality to combine \\P{s,a)'^(j)i — 
P(s, a)''' 01 1 loo with \\4>t\\i. 

While Assumption 14 characterizes general properties 
of the MDP, the following assumption unifies the as- 
sumptions on the MDP and the sampling. 

Assumption 16 (Sufficient Sampling). Assume that 
for Vs G iS, there exists 3s G S, such that: 

. Uis) - (j>is)\\oo < S^ 

• |r(s) — r{s)\ < 6r 

• ||P(s,a)T0, -P(s,a)V.||oo<'5p Va G ^ 

where P(s, a) represents the vector of transition prob- 
abilities from states s given action a. 

Theorem 17. Assume Assumption 2 and 
Assumption 16. Let the value function v he rep- 
resented as V ^ Ad such that \\x\\i < 4'- Then 
Assumption 4 is satisfied with the following con- 
straints: 



Then we get, using Holder's inequality that: 



min(u — Lv){s) > 

> min(w — Lv)(s) — {v — Lv){s) 
>- max |((</.(s) -7P(s,a)"^$) 

sES,a£A 



5r + -0(^0 + jSp) 



The theorem also holds if a column of ^ is 1 and the 
corresponding element of x is not included in the norm. 



- ((/>(s) - 7P(s, ay^)x) - r{s) + r(s)| 

- max ||((/.(s)-7P(s,a)T<i>) 

sES,a£A 

- {4>{s) ~ jPis, a)''<^)\U\x\\, + |r(s) + r{s)\ 

- max ||((/.(s)-7P(s,a)Tci>) 

s£S,aeA 

-(0(s)-7P(s,a)T$)||^i^ + 5, 
max ||((^(s)-0(s)||i-|-7 

sES,aGA 

-(0(s)-P(s,a)T$)||^^-H5, 

-(<50 + Sp)tp - Sr 



>- max \ms) - 0(s)||i +7||P(s,a)T$) 

sES,aGA 



> 



D 



The following then summarizes the results. 

Theorem 7. Assume Assumption 6 and that for 

any s ^ S there exists a state s G S such that \\s — s\\ < 
c. Then Assumption 4 and Assumption 5 hold with 
ep{ijj) = cKr -\- c-ip{K^ -I- jKp) 

Proof. The corollary follows simply by setting the fol- 
lowing values. 

Sc/, = cKc/, 

5r = cKr 

5p = cKp 

n 



Proof. From the assumptions in the theorem, we have 
that V > Lv and we need to show that v > Lv — epl. 



Notice that for simplicity, we did not provide any 
bounds on Ec- These bounds require additional as- 
sumptions on the sampling procedure. That is the 
sampling must not only cover the whole space, but 
must also be uniformly distributed over it. Without 
such distribution, the objective function p must be a 
weighted sum of the states. 

B. Homotopy Continuation Method 

The homotopy algorithm is similar to sensitivity anal- 
ysis of the standard simplex algorithm. However, the 
basic feasible solutions in simplex are of the size of 
the number of variables. Because we are interested in 
solving very large linear programs, this is often im- 
practical. The homotopy method we propose instead 



Feature Selection Using Regularization 



relies on basic basic feasible solutions with size that 
corresponds to the number of variables. 

As before, we use 1,; to denote a zero vector with i-th 
element set to 1. For a matrix A, we use Aj to denote 
the j-th row and A.i as i-the column. We also use x{i) 
to denote the i-the element of the vector. We derive 
the algorithm for a generic linear program, defined as 
follows: 



s.t. 



C X 



Ax>b 
e X < ijj 
x>Q 



(8) 



Note that the variables are constrained to be non- 
negative. Any unbounded variable z can be expressed 
as z = z+ — Z-, such that z+,Z- > 0. The homotopy 
algorithm also relies on the dual formulation of the 
linear program (8), which is as follows. 



max by — tpX 
s.t. A^y ~ eX < c 



(9) 



2/,A>0 



The algorithm traces the optimal solution of the linear 
program (8) as a function of -0, which is defined as 
follows. 

Definition 18. The optimal solution of the linear pro- 
gram (8) as a function of -0 is denoted as x{ip), assum- 
ing that the optimal solution is a singleton. Notice 
that this is the optimal solution, not the optimal ob- 
jective value. We also use y{ip) and X{ip) similarly to 
denote the sets of optimal solutions of the dual pro- 
gram (9) for the given the regularization coefficient. 

The homotopy algorithm keeps a set of active vari- 
ables B{x, y) and a set of active constraints C{x, y). A 
variable is considered to be active if it is non-zero. A 
constraint is considered to be active when the corre- 
sponding dual value is non-negative. Active and inac- 
tive variables and constraints are formally defined as 
follows. 



B{x,y) 
C{x,y) 



{i\x{i) > 0} 
{j I yU) > 0} 



J\f{x, y) 
V{x,y) 



{i\x{i) =0} 

{i|y(i)-o} 



We use B in place of B{x, y) when the values of x, y 
are apparent from the context. Notice that this def- 
inition is intentionally ambiguous, that is for a given 
value of x, y, the active variables and constraints are 
not uniquely specified. This is intentional, to allow for 
adding and removing them at the points of disconti- 
nuity. Although, active primal and dual variables may 



be 0, the inactive variables always must be 0. The 
active variables and constraints can be used to define 
the following variables. 



be 
b-D 



& = ( - 1 2/ = ( ^M A 
We assume the given order of variables 



XB 



Abc 
Abt) 



Aj\fv 



The following assumptions are needed in order to de- 
rive the algorithm. 

Assumption 19. The optimal solution of (8) is feasi- 
ble and bounded for values oi ip € [0, oo). In addition, 
it is "easy" to solve for i/; = 0. 

Assumption 20. For any ip e [0, oo) the solution of 
(8) is not degenerate. 

These assumptions guarantee that at no point the solu- 
tions become degenerate. This is a common assump- 
tion in simplex algorithms, and can be remedied for 
example by assuming a small perturbation of vari- 
ables (Vanderbei, 2001). 

The homotopy algorithm traces the optimal solution 
of the linear program, which can be characterized by 
a set of linear equations. These optimality conditions 
for (8) can be derived identically from KKT or from 
the complementary slackness optimality conditions as 
follows. 



-e^x > 



-0 



Ax>b 

A'^y <c + Xe 

y'^ib -Ax)=0 

X{e^ - V) = 

x'^{c-A'^y + eX) =0 

x,y,X>0 

Without loss of generality, we assume that the active 
constraints and variables are the first in the respective 
structures. This docs not impose any limitations, and 
could be expressed generally using a permutation ma- 
trix P. We implicitly assume that the regularization 
vector e is partitioned properly for the active variables. 

For a given set of active variables and constraints, and 
using the fact that the inactive variables x and y are 
0, the optimality conditions can be then rewritten as: 



e^XB = "0 




Abcxb = be 


Abvxb < b-D 


AjicVc =CB + Xe 


AjjcVc < 


x,y,X>0 
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We are assuming here that the regularization con- 
straint is active. If it becomes inactive at "0, the solu- 
tion is optimal for any value oi ip > ip. The equalities 
follow from the complementarity conditions, omitted 
here. Notice that other constraints may also hold with 
equality. 

The homotopy algorithm is included in Algorithm 1. 
The primal update of the algorithm traces the solution 
in the linear segments and the dual updates determines 
the update direction the sections with non-linearities. 
The algorithm implementation is written with the fo- 
cus on simplicity; an efficient implementation relies on 
factorization of the matrices. Two homotopy continu- 
ation methods — DASSO (James et al., 2009) and pri- 
mal dual pursuit (Asif, 2009) — have been proposed 
for the Dantzig selector. These homotopy methods 
arc very efficient when the problems have sparse so- 
lutions, as is often the case with Dantzig selectors. 
RALP cannot be solved directly using the methods 
for the Dantzig selector, however, because of its differ- 
ent structure. This structure can be used to develop 
a different homotopy method for solving RALP, which 
is described in Appendix B. In addition, a method 
based on parametric linear program solvers has been 
developed for regularized linear programs. However, 
the parametric simplex algorithm cannot take advan- 
tage of sparse RALP solutions and therefore is not 
applicable to large RALP. The finite-time convergence 
of Algorithm 1 can be, however, shown identically to 
DASSO, or other related algorithms. 
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1 ^0^0; 

// Find an initial feasible solutions 

2 a;° ^ a;(V'°) ; 

3 y" ^ y(V'°) ; 

// Determine the initial active sets, and set J\f and 2? to be their complements 

4 B^ = {i\x^ > Q) C^ -^ {i\v{i) > 0} // The regularization constraint is active, or the 

solution is optimal 

5 while f/;* < -0 and A* > do 
i ^^ i + 1 ; 

// Here \C\ = \B\+2 
11 Calculate the space (line) of dual solutions — the update direction 

. ^j ) ^ null (^g|^ e) such that y'-^^{ijj) = ^ Ay*(V') > ; // This is always possible 

because there is always at most one such constraint, given the assumptions. 

// Decide based on a potential variable improvement 

// Calculate the maximum length of the update r, breaking ties arbitrarily. 

ti <— (maxfcgc ~ i-i ) ; // Some y becomes 0. 

( -(Aj^c^y') \ 
t2 <— va&yiktzjsf -r^ i_is '^ J ; // Some x needs to be added to the active set. 



9 
10 
11 

12 
13 

14 
15 

16 

17 

18 



19 



21 



23 

24 

25 
26 



^3 ^ -AA ' // Regularization constraint 

T = min {ti, ^2, is} // Resolve the non-linearity update, where Ki is the set of 

maximizers for </ 
if r = <i then 

else if T = t2 then 

|_ B' ^ B'-^ U K2, Af' ^ {B'f 

else if T = ^3 then 
I The regularization constraint is inactive, return the solution. 

// Update the dual solutions 

y' ^ y'-i + TAy\ A' ^ A*-i + tAA* ; 

// Here \C\ = \B\ + 1 

// Calculate the update direction 

Abc\ "V 



Aa- , , , , 

eg J \A?/; 

// Calculate the maximum length of the update t, breaking ties arbitrarily. 

^4 <— Imaxfcgp T.°.^-i^i, ) ; // A constraint becomes active 

tg 4- (maxfcgB-T-i^) ; // A variable t = mmiti,^} 



II Update the primal solutions 

x^ ^— Xi-i + tAx^ ; 

// Resolve the non-linearity update, where Ki is the set of maximizers for ti 

if T = ^4 then 

else if T = ^5 then 

\_B' ^B'-^\K5, N'^[B^f 



\C 



Algorithm 1. Homotopy Continuation Method for Solving ALP 
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